Lab Assignment 4: Spatial Predictive Analysis

Spatial Predictive Analysis of Vacant and Abandoned Buildings

Author

Ye Zhang

Published

November 16, 2025

Introduction

In this analysis, I will build a spatial predictive model for vacant and abandoned buildings in Chicago. Abandoned buildings are a critical urban issue that affects neighborhood stability, property values, public safety, and community well-being. By analyzing the spatial patterns of reported vacant and abandoned buildings, we can better understand where these issues concentrate and potentially improve city intervention strategies.

Setup

Code
# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE
library(lubridate)      # Date handling
library(terra)          # Raster operations

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_buildings <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_buildings())

cat("✓ All packages loaded successfully!\n")
✓ All packages loaded successfully!

Part 1: Data Loading & Exploration

Load Chicago Spatial Boundaries

Code
# Load police districts - we'll use these for cross-validation later
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 25 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84
Code
# Load police beats - smaller units within districts
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 277 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84
Code
# Load Chicago boundary - the city outline
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')
Reading layer `chicagoBoundary' from data source 
  `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84
Code
cat("✓ Loaded spatial boundaries\n")
✓ Loaded spatial boundaries
Code
cat("  - Police districts:", nrow(policeDistricts), "\n")
  - Police districts: 25 
Code
cat("  - Police beats:", nrow(policeBeats), "\n")
  - Police beats: 277 

Load Vacant and Abandoned Buildings Data

Code
# Load vacant and abandoned buildings data
building_raw <- read_csv(here("data", "311_Service_Requests_-_Vacant_and_Abandoned_Buildings_Reported_-_Historical_20251116.csv"))

# Check column names
cat("Column names in dataset:\n")
Column names in dataset:
Code
print(names(building_raw))
 [1] "SERVICE REQUEST TYPE"                                                 
 [2] "SERVICE REQUEST NUMBER"                                               
 [3] "DATE SERVICE REQUEST WAS RECEIVED"                                    
 [4] "LOCATION OF BUILDING ON THE LOT (IF GARAGE, CHANGE TYPE CODE TO BGD)."
 [5] "IS THE BUILDING DANGEROUS OR HAZARDOUS?"                              
 [6] "IS BUILDING OPEN OR BOARDED?"                                         
 [7] "IF THE BUILDING IS OPEN, WHERE IS THE ENTRY POINT?"                   
 [8] "IS THE BUILDING CURRENTLY VACANT OR OCCUPIED?"                        
 [9] "IS THE BUILDING VACANT DUE TO FIRE?"                                  
[10] "ANY PEOPLE USING PROPERTY? (HOMELESS, CHILDEN, GANGS)"                
[11] "ADDRESS STREET NUMBER"                                                
[12] "ADDRESS STREET DIRECTION"                                             
[13] "ADDRESS STREET NAME"                                                  
[14] "ADDRESS STREET SUFFIX"                                                
[15] "ZIP CODE"                                                             
[16] "X COORDINATE"                                                         
[17] "Y COORDINATE"                                                         
[18] "Ward"                                                                 
[19] "Police District"                                                      
[20] "Community Area"                                                       
[21] "LATITUDE"                                                             
[22] "LONGITUDE"                                                            
[23] "Location"                                                             
Code
# Convert to spatial object
buildings <- building_raw %>%
  # Remove rows without coordinates
  filter(!is.na(`X COORDINATE`), !is.na(`Y COORDINATE`)) %>%
  # Convert to sf object - coordinates are in State Plane
  st_as_sf(coords = c("X COORDINATE", "Y COORDINATE"), crs = 3435) %>%  
  # Transform to the CRS used in the exercise (ESRI:102271)
  st_transform('ESRI:102271')

# Check the data
cat("\n✓ Loaded abandoned buildings data\n")

✓ Loaded abandoned buildings data
Code
cat("  - Number of building reports:", nrow(buildings), "\n")
  - Number of building reports: 65082 
Code
cat("  - CRS:", st_crs(buildings)$input, "\n")
  - CRS: ESRI:102271 

Loading the 311 service requests for vacant and abandoned buildings and converting them to a spatial format that matches our other Chicago data.

Visualize Building Distribution

Code
# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = buildings, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Vacant & Abandoned Building Locations",
    subtitle = paste0("Chicago, n = ", format(nrow(buildings), big.mark = ","))
  )

# Density surface using modern syntax
buildings_coords_plot <- as.data.frame(st_coordinates(buildings))
names(buildings_coords_plot) <- c("X", "Y")

p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = buildings_coords_plot,
    aes(x = X, y = Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Vacant & Abandoned Buildings in Chicago",
    tag_levels = 'A'
  )

Vacant and abandoned buildings are not evenly distributed across Chicago. The density surface reveals distinct concentration patterns with hot spots appearing in specific neighborhoods. There are clear clusters in the central and also south part of the city. High cluster in the south part of the city.

Part 2: Create Fishnet Grid

Create the Grid

Code
# Create a 500m x 500m grid over Chicago
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # Each cell is 500 meters on each side
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that actually overlap with Chicago
fishnet <- fishnet[chicagoBoundary, ]

cat("✓ Created fishnet grid\n")
✓ Created fishnet grid
Code
cat("  - Number of cells:", nrow(fishnet), "\n")
  - Number of cells: 2458 
Code
cat("  - Cell size: 500 x 500 meters\n")
  - Cell size: 500 x 500 meters

Creating a grid of 500m x 500m squares that covers Chicago. This converts our point data into a regular grid that’s easier to analyze and model.

Aggregate Buildings to Grid Cells

Code
# Count how many building reports fall in each grid cell
buildings_fishnet <- st_join(buildings, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBuildings = n())

# Join counts back to the grid (cells with 0 get NA, so we replace with 0)
fishnet <- fishnet %>%
  left_join(buildings_fishnet, by = "uniqueID") %>%
  mutate(countBuildings = replace_na(countBuildings, 0))

# Summary statistics
cat("\nBuilding count distribution:\n")

Building count distribution:
Code
summary(fishnet$countBuildings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    7.00   26.47   28.00  378.00 
Code
cat("\nCells with zero buildings:", 
    sum(fishnet$countBuildings == 0), 
    "/", nrow(fishnet),
    "(", round(100 * sum(fishnet$countBuildings == 0) / nrow(fishnet), 1), "%)\n")

Cells with zero buildings: 542 / 2458 ( 22.1 %)

Counting how many abandoned building reports occurred in each grid cell.

Visualize the Grid

Code
# Map the counts in each grid cell
ggplot() +
  geom_sf(data = fishnet, aes(fill = countBuildings), color = NA) +
  scale_fill_viridis_c(
    name = "Abandoned\nBuildings",
    option = "plasma",
    trans = "sqrt"  # Square root transformation to see variation better
  ) +
  labs(
    title = "Vacant & Abandoned Buildings by 500m Grid Cell",
    subtitle = "Chicago"
  ) +
  theme_buildings()

The fishnet grid reveals clear spatial concentration of abandoned building reports. The highest counts (darker colors) cluster in south part of the city. Many cells have zero reports, while some hot spot cells have significantly elevated counts, indicating neighborhoods with persistent vacancy and abandonment issues.

Part 3: Spatial Features

Calculate k-Nearest Neighbors

Code
# Get coordinates of building points
buildings_coords <- st_coordinates(buildings)

# Get coordinates of grid cell centroids
fishnet_coords <- st_coordinates(st_centroid(fishnet))

# For each grid cell, find average distance to 5 nearest buildings
nn_distances <- get.knnx(
  data = buildings_coords,  
  query = fishnet_coords,   
  k = 5                    
)

# Add to fishnet
fishnet <- fishnet %>%
  mutate(
    buildings.nn = rowMeans(nn_distances$nn.dist)
  )

cat("✓ Calculated k-nearest neighbors\n")
✓ Calculated k-nearest neighbors
Code
cat("  - k = 5 nearest buildings\n")
  - k = 5 nearest buildings
Code
cat("  - Mean distance:", round(mean(fishnet$buildings.nn), 1), "meters\n")
  - Mean distance: 291.5 meters

For each grid cell, we find the average distance to the 5 nearest abandoned building reports. This measures how close each cell is to existing abandoned buildings.

Visualize k-NN Feature

Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = buildings.nn), color = NA) +
  scale_fill_viridis_c(
    name = "Mean Distance\nto 5 Nearest\nBuildings (m)",
    option = "viridis",
    direction = -1
  ) +
  labs(
    title = "Average Distance to Nearest Abandoned Buildings",
    subtitle = "Lower values = closer to existing abandoned buildings"
  ) +
  theme_buildings()

The k-NN map shows the consistent yellow coloring across most of the city suggests that abandoned buildings are a pervasive issue affecting nearly all neighborhoods rather than being isolated to just a few problem areas.

Part 4: Local Moran’s I Analysis

Calculate Local Moran’s I

Code
# Create spatial weights matrix (queen contiguity)
coords <- st_centroid(fishnet) %>% st_coordinates()
weight_matrix <- knn2nb(knearneigh(coords, k = 4))
weights <- nb2listw(weight_matrix, style = "W", zero.policy = TRUE)

# Calculate Local Moran's I
fishnet <- fishnet %>%
  mutate(
    # Calculate local Moran's I statistic
    localMorans = localmoran(countBuildings, weights, zero.policy = TRUE)[,1],
    # Get p-values
    localMorans_pval = localmoran(countBuildings, weights, zero.policy = TRUE)[,5]
  )

# Categorize into hot spots, cold spots, and not significant
fishnet <- fishnet %>%
  mutate(
    cluster = case_when(
      localMorans_pval > 0.05 ~ "Not Significant",
      localMorans > 0 & countBuildings > median(countBuildings) ~ "High-High (Hot Spot)",
      localMorans > 0 & countBuildings <= median(countBuildings) ~ "Low-Low (Cold Spot)",
      localMorans < 0 & countBuildings > median(countBuildings) ~ "High-Low (Outlier)",
      localMorans < 0 & countBuildings <= median(countBuildings) ~ "Low-High (Outlier)",
      TRUE ~ "Not Significant"
    )
  )

# Summary
cat("Cluster distribution:\n")
Cluster distribution:
Code
table(fishnet$cluster)

High-High (Hot Spot)   High-Low (Outlier)   Low-High (Outlier) 
                 270                    6                    2 
     Not Significant 
                2180 

Local Moran’s I identifies clusters of similar values. Hot spots are high-abandonment areas surrounded by high-abandonment neighbors. Cold spots are low-abandonment areas surrounded by low-abandonment neighbors.

Visualize Hot Spots

Code
# Map 1: Cluster types
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = cluster), color = NA) +
  scale_fill_manual(
    name = "Cluster Type",
    values = c(
      "High-High (Hot Spot)" = "#d7191c",
      "Low-Low (Cold Spot)" = "#2c7bb6",
      "High-Low (Outlier)" = "#fdae61",
      "Low-High (Outlier)" = "#abd9e9",
      "Not Significant" = "gray90"
    )
  ) +
  labs(
    title = "Local Moran's I Clusters",
    subtitle = "Spatial autocorrelation of vacant & abandoned buildings"
  ) +
  theme_buildings()

# Map 2: Local Moran's I values
p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = localMorans), color = NA) +
  scale_fill_gradient2(
    name = "Local\nMoran's I",
    low = "#2c7bb6",
    mid = "white",
    high = "#d7191c",
    midpoint = 0
  ) +
  labs(
    title = "Local Moran's I Values",
    subtitle = "Positive = similar neighbors, Negative = dissimilar neighbors"
  ) +
  theme_buildings()

p1 + p2

The hot spot analysis identifies concentrated high-abandonment clusters (red areas) in south Chicago neighborhoods and some in the north central. Cold spots (blue) appear in areas with consistently low abandonment and there are very few of them. The spatial clustering is statistically significant, confirming that building abandonment is not randomly distributed but concentrates in neighborhoods that might experiencing disinvestment.

Calculate Distance to Hot Spots

Code
# Identify hot spot cells
hotspots <- fishnet %>%
  filter(cluster == "High-High (Hot Spot)")

cat("Number of hot spots identified:", nrow(hotspots), "\n")
Number of hot spots identified: 270 
Code
# Only calculate distances if we have hot spots
if(nrow(hotspots) > 0) {
  # Get coordinates
  hotspot_coords <- st_coordinates(st_centroid(hotspots))
  fishnet_coords <- st_coordinates(st_centroid(fishnet))
  
  # Calculate distance from each cell to nearest hot spot
  nn_to_hotspot <- get.knnx(
    data = hotspot_coords,
    query = fishnet_coords,
    k = 1
  )
  
  # Add to fishnet
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = nn_to_hotspot$nn.dist[,1]
    )
  
  cat("✓ Calculated distance to nearest hot spot\n")
  cat("  - Number of hot spots:", nrow(hotspots), "\n")
  cat("  - Mean distance to hot spot:", round(mean(fishnet$dist_to_hotspot), 1), "meters\n")
} else {
  # If no hot spots, create a placeholder distance variable
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  cat("⚠ No hot spots identified - using placeholder distance variable\n")
}
✓ Calculated distance to nearest hot spot
  - Number of hot spots: 270 
  - Mean distance to hot spot: 3873.7 meters

Measuring how far each grid cell is from the nearest identified hot spot of building abandonment.

Visualize Distance to Hot Spots

Code
if(nrow(hotspots) > 0) {
  ggplot() +
    geom_sf(data = fishnet, aes(fill = dist_to_hotspot), color = NA) +
    scale_fill_viridis_c(
      name = "Distance to\nNearest Hot Spot\n(meters)",
      option = "magma",
      direction = -1
    ) +
    labs(
      title = "Distance to Nearest Abandonment Hot Spot",
      subtitle = "Lower values = closer to hot spots"
    ) +
    theme_buildings()
} else {
  cat("No hot spots to visualize\n")
}

The map shows clear concentric rings radiating outward from a central hot spot in west-central Chicago, where the lightest (near-zero distance) areas indicate the core abandonment cluster. The gradient pattern reveals that most of the northern half of the city is relatively close to this hot spot (orange tones), while the far north and some southern areas are more isolated from the main abandonment concentration (darker purple), suggesting these neighborhoods may be less affected by the spillover effects of concentrated building vacancy.

Part 4: Count Regression Models

Prepare Modeling Data

Code
# Join spatial features with police district information
fishnet_model <- fishnet %>%
  st_join(policeDistricts, join = st_intersects, left = TRUE) %>%
  st_drop_geometry() %>%
  filter(!is.na(District))  # Remove cells outside districts

# Add scaled versions of predictors (helps with NB convergence)
fishnet_model <- fishnet_model %>%
  mutate(
    buildings.nn_scaled = scale(buildings.nn)[,1],
    dist_to_hotspot_scaled = scale(dist_to_hotspot)[,1]
  )

cat("✓ Prepared modeling dataset\n")
✓ Prepared modeling dataset
Code
cat("  - Number of cells:", nrow(fishnet_model), "\n")
  - Number of cells: 2891 
Code
cat("  - Number of districts:", n_distinct(fishnet_model$District), "\n")
  - Number of districts: 23 

Preparing our data for modeling by joining it with police districts (for cross-validation) and creating scaled versions of our predictor variables.

Fit Poisson Regression

Code
# Fit Poisson model
model_poisson <- glm(
  countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
  data = fishnet_model,
  family = poisson()
)

# Summary
summary(model_poisson)

Call:
glm(formula = countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled, 
    family = poisson(), data = fishnet_model)

Coefficients:
                       Estimate Std. Error z value            Pr(>|z|)    
(Intercept)             1.06854    0.01458    73.3 <0.0000000000000002 ***
buildings.nn_scaled    -3.36737    0.02359  -142.7 <0.0000000000000002 ***
dist_to_hotspot_scaled -0.92218    0.00901  -102.4 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 151530  on 2890  degrees of freedom
Residual deviance:  23630  on 2888  degrees of freedom
AIC: 33755

Number of Fisher Scoring iterations: 5
Code
# Get AIC - lower is better
cat("\nPoisson Model AIC:", AIC(model_poisson), "\n")

Poisson Model AIC: 33755.26 

Fitting a Poisson regression model that predicts building counts using distance to nearest buildings and distance to hot spots.

Fit Negative Binomial Regression

Code
# Negative Binomial can have convergence issues with extreme values
# We'll try multiple strategies to fit the model

# First, check the data distribution
cat("Data diagnostics:\n")
Data diagnostics:
Code
cat("Building count range:", min(fishnet_model$countBuildings), "to", max(fishnet_model$countBuildings), "\n")
Building count range: 0 to 378 
Code
cat("Mean:", round(mean(fishnet_model$countBuildings), 2), "\n")
Mean: 26.82 
Code
cat("Variance:", round(var(fishnet_model$countBuildings), 2), "\n")
Variance: 2084.37 
Code
cat("Variance/Mean ratio:", round(var(fishnet_model$countBuildings)/mean(fishnet_model$countBuildings), 2), "\n\n")
Variance/Mean ratio: 77.72 
Code
# Try fitting with robust error handling
model_nb <- tryCatch({
  cat("Trying NB model with scaled predictors...\n")
  glm.nb(
    countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
    data = fishnet_model,
    control = glm.control(maxit = 100)
  )
}, error = function(e1) {
  cat("Direct fit failed. Trying with Poisson starting values...\n")
  
  # Fit Poisson with scaled predictors
  model_poisson_scaled <- glm(
    countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
    data = fishnet_model,
    family = poisson()
  )
  
  tryCatch({
    glm.nb(
      countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
      data = fishnet_model,
      start = coef(model_poisson_scaled),
      init.theta = 1,
      control = glm.control(maxit = 200, epsilon = 1e-4)
    )
  }, error = function(e2) {
    cat("Still failing. Trying with subset of data to estimate theta...\n")
    
    # Remove extreme outliers to get initial estimates
    fishnet_subset <- fishnet_model %>%
      filter(countBuildings < quantile(countBuildings, 0.95))
    
    model_subset <- glm.nb(
      countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
      data = fishnet_subset
    )
    
    # Use subset estimates for full model
    glm.nb(
      countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
      data = fishnet_model,
      start = coef(model_subset),
      init.theta = model_subset$theta,
      control = glm.control(maxit = 200, epsilon = 1e-4)
    )
  })
})
Trying NB model with scaled predictors...
Code
# Summary
summary(model_nb)

Call:
glm.nb(formula = countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled, 
    data = fishnet_model, control = glm.control(maxit = 100), 
    init.theta = 3.469696575, link = log)

Coefficients:
                       Estimate Std. Error z value            Pr(>|z|)    
(Intercept)             1.08864    0.02663   40.88 <0.0000000000000002 ***
buildings.nn_scaled    -3.79635    0.05148  -73.75 <0.0000000000000002 ***
dist_to_hotspot_scaled -0.48137    0.01711  -28.13 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(3.4697) family taken to be 1)

    Null deviance: 21692.6  on 2890  degrees of freedom
Residual deviance:  2658.7  on 2888  degrees of freedom
AIC: 16847

Number of Fisher Scoring iterations: 1

              Theta:  3.470 
          Std. Err.:  0.124 

 2 x log-likelihood:  -16839.458 
Code
# Compare AIC (lower is better)
cat("\nModel Comparison:\n")

Model Comparison:
Code
cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")
Poisson AIC: 33755.3 
Code
cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")
Negative Binomial AIC: 16847.5 
Code
cat("\n✓ Model fitted successfully\n")

✓ Model fitted successfully

The lower AIC tells us which model fits better. Negative Binomial has a better fit through having a lower AIC. This large difference indicates that the data exhibits severe overdispersion - meaning the variance in abandoned building counts is much greater than the mean. The Poisson model’s assumption that mean equals variance is strongly violated, making the Negative Binomial model the clearly superior choice for modeling this data. This overdispersion is typical in urban abandonment data where most areas have few or no abandoned buildings, but some hot spot areas have very high concentrations.

Part 5: Spatial Cross-Validation (2017)

Implement Leave-One-Group-Out CV

Code
# Get list of districts
districts <- unique(fishnet_model$District)

# Storage for results
cv_results <- tibble()

cat("Running Leave-One-Group-Out Cross-Validation...\n")
Running Leave-One-Group-Out Cross-Validation...
Code
cat("(", length(districts), "folds - one per district)\n\n")
( 23 folds - one per district)
Code
# Loop through each district
for(i in 1:length(districts)) {
  test_district <- districts[i]
  
  # Split data
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  # Fit model on training data with robust error handling
  model_cv <- tryCatch({
    glm.nb(
      countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
      data = train_data,
      control = glm.control(maxit = 100)
    )
  }, error = function(e) {
    # Fallback: use Poisson starting values
    model_poisson_cv <- glm(
      countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
      data = train_data,
      family = poisson()
    )
    
    tryCatch({
      glm.nb(
        countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
        data = train_data,
        start = coef(model_poisson_cv),
        init.theta = 1,
        control = glm.control(maxit = 100)
      )
    }, error = function(e2) {
      # Last resort: use subset approach
      train_subset <- train_data %>%
        filter(countBuildings < quantile(countBuildings, 0.95))
      
      model_subset <- glm.nb(
        countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
        data = train_subset
      )
      
      glm.nb(
        countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
        data = train_data,
        start = coef(model_subset),
        init.theta = model_subset$theta,
        control = glm.control(maxit = 100)
      )
    })
  })
  
  # Predict on test data
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate error metrics
  mae <- mean(abs(test_data$countBuildings - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBuildings - test_data$prediction)^2))
  
  # Store results
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = mae,
      rmse = rmse
    )
  )
  
  cat("  Fold", i, "/", length(districts), "- District", test_district, 
      "- MAE:", round(mae, 2), "\n")
}
  Fold 1 / 23 - District 5 - MAE: 17.77 
  Fold 2 / 23 - District 4 - MAE: 9.2 
  Fold 3 / 23 - District 22 - MAE: 12.62 
  Fold 4 / 23 - District 31 - MAE: 1.08 
  Fold 5 / 23 - District 6 - MAE: 22.23 
  Fold 6 / 23 - District 8 - MAE: 10.16 
  Fold 7 / 23 - District 7 - MAE: 70.75 
  Fold 8 / 23 - District 3 - MAE: 22.32 
  Fold 9 / 23 - District 2 - MAE: 9.29 
  Fold 10 / 23 - District 9 - MAE: 15 
  Fold 11 / 23 - District 10 - MAE: 13.82 
  Fold 12 / 23 - District 1 - MAE: 1.76 
  Fold 13 / 23 - District 12 - MAE: 9.04 
  Fold 14 / 23 - District 15 - MAE: 21.81 
  Fold 15 / 23 - District 11 - MAE: 32.54 
  Fold 16 / 23 - District 18 - MAE: 1.42 
  Fold 17 / 23 - District 25 - MAE: 14.23 
  Fold 18 / 23 - District 14 - MAE: 12.21 
  Fold 19 / 23 - District 19 - MAE: 3.51 
  Fold 20 / 23 - District 16 - MAE: 2.12 
  Fold 21 / 23 - District 17 - MAE: 2.93 
  Fold 22 / 23 - District 20 - MAE: 1.98 
  Fold 23 / 23 - District 24 - MAE: 2.6 
Code
# Overall results
cat("\n✓ Cross-Validation Complete\n")

✓ Cross-Validation Complete
Code
cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")
Mean MAE: 13.49 
Code
cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")
Mean RMSE: 21.48 

Testing our model by holding out one police district at a time, training on the rest, and seeing how well we predict the held-out district.

Cross-Validation Results

Code
# Show results by district
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District",
    col.names = c("Fold", "District", "N Test Cells", "MAE", "RMSE")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
LOGO CV Results by District
Fold District N Test Cells MAE RMSE
7 7 86 70.75 96.35
15 11 79 32.54 48.31
8 3 86 22.32 34.01
5 6 109 22.23 30.62
14 15 60 21.81 34.21
1 5 170 17.77 33.61
10 9 167 15.00 27.44
17 25 135 14.23 20.45
11 10 111 13.82 23.45
3 22 187 12.62 22.26
18 14 84 12.21 19.07
6 8 281 10.16 18.74
9 2 99 9.29 15.71
2 4 325 9.20 22.08
13 12 127 9.04 15.44
19 19 116 3.51 7.12
21 17 126 2.93 4.94
23 24 76 2.60 3.67
20 16 224 2.12 3.70
22 20 61 1.98 3.28
12 1 71 1.76 4.14
16 18 74 1.42 3.00
4 31 37 1.08 2.55

Spatial cross-validation (LOGO CV) is more appropriate than random CV because of spatial autocorrelation - nearby locations tend to have similar characteristics. District 7 had the highest MAE and RMSE, indicating it has unique abandonment patterns not well-captured by the model’s spatial features or differs significantly from other districts in the training data.

Part 6: Model Evaluation

Generate Final Predictions

Code
# First, add scaled predictors to the full fishnet for predictions
fishnet <- fishnet %>%
  mutate(
    buildings.nn_scaled = (buildings.nn - mean(fishnet_model$buildings.nn, na.rm = TRUE)) / 
                         sd(fishnet_model$buildings.nn, na.rm = TRUE),
    dist_to_hotspot_scaled = (dist_to_hotspot - mean(fishnet_model$dist_to_hotspot, na.rm = TRUE)) / 
                              sd(fishnet_model$dist_to_hotspot, na.rm = TRUE)
  )

# Fit final model on all data with robust error handling
final_model <- tryCatch({
  cat("Fitting final NB model...\n")
  glm.nb(
    countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
    data = fishnet_model,
    control = glm.control(maxit = 100)
  )
}, error = function(e) {
  cat("Using Poisson starting values for final model...\n")
  
  model_poisson_final <- glm(
    countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
    data = fishnet_model,
    family = poisson()
  )
  
  tryCatch({
    glm.nb(
      countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
      data = fishnet_model,
      start = coef(model_poisson_final),
      init.theta = 1,
      control = glm.control(maxit = 200)
    )
  }, error = function(e2) {
    # Last resort
    fishnet_subset <- fishnet_model %>%
      filter(countBuildings < quantile(countBuildings, 0.95))
    
    model_subset <- glm.nb(
      countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
      data = fishnet_subset
    )
    
    glm.nb(
      countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
      data = fishnet_model,
      start = coef(model_subset),
      init.theta = model_subset$theta,
      control = glm.control(maxit = 200)
    )
  })
})
Fitting final NB model...
Code
# Create prediction data frame matching fishnet structure
pred_data <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(uniqueID, buildings.nn_scaled, dist_to_hotspot_scaled)

# Add predictions to fishnet
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, pred_data, type = "response")
  )

cat("✓ Final predictions generated\n")
✓ Final predictions generated

Compare to KDE Baseline

Code
# Create KDE surface as baseline comparison
# Convert to spatstat format
buildings_ppp <- as.ppp(st_coordinates(buildings), W = as.owin(chicagoBoundary))

# Calculate KDE
kde_buildings <- density.ppp(buildings_ppp, sigma = 1000)

# Convert back to raster and extract to fishnet
kde_raster <- rast(kde_buildings)
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(kde_raster, st_centroid(fishnet))[,2]
  )

# Normalize KDE to same scale as counts
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBuildings, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

Visualize Predictions

Code
# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBuildings), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 50)) +
  labs(title = "Actual Building Reports") +
  theme_buildings()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 50)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_buildings()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 50)) +
  labs(title = "KDE Baseline Predictions") +
  theme_buildings()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Vacant & Abandoned Buildings"
  )

Model Performance Comparison

Code
# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBuildings - prediction_nb)),
    model_rmse = sqrt(mean((countBuildings - prediction_nb)^2)),
    kde_mae = mean(abs(countBuildings - prediction_kde)),
    kde_rmse = sqrt(mean((countBuildings - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison",
    col.names = c("Approach", "MAE", "RMSE")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Performance Comparison
Approach MAE RMSE
model 13.50 28.35
kde 15.66 27.79

The Negative Binomial model slightly beats the KDE baseline with a MAE of 13.50 versus 15.66, meaning it’s off by about 2 fewer buildings per cell - a modest 14% improvement. While the performance gain isn’t huge, the model provides insight into why certain areas are at risk by showing that proximity to abandoned buildings and hot spots matters for prediction. Both approaches struggle in the same places, particularly underpredicting the most intense abandonment areas in the central west corridor and overpredicting in moderate zones.

Map Prediction Errors

Code
# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBuildings - prediction_nb,
    abs_error_nb = abs(error_nb)
  )

# Map signed errors (positive = underpredicted, negative = overpredicted)
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-20, 20)
  ) +
  labs(title = "Model Errors (Actual - Predicted)",
       subtitle = "Red = underpredicted, Blue = overpredicted") +
  theme_buildings()

# Map absolute errors
p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_buildings()

p1 + p2

The error maps show a fairly balanced mix of underprediction (red) and overprediction (blue) scattered throughout the city, with no strong systematic spatial bias in the signed errors. However, the absolute error map reveals that the largest mistakes concentrate in specific hotspot areas in the central and western parts of Chicago, indicating the model struggles most with neighborhoods experiencing the most severe abandonment where counts are highest and most variable.

Conclusion

Model Performance:

  • The Negative Binomial model significantly outperformed Poisson (AIC: 16,847 vs 33,755), confirming severe overdispersion in the data where variance far exceeds the mean

  • Cross-validation showed reasonable generalization with District 7 being the hardest to predict, indicating some districts have unique abandonment patterns

  • The model slightly outperformed the KDE baseline (MAE: 13.50 vs 15.66), achieving about 14% improvement in prediction accuracy

Spatial Patterns:

  • Abandoned buildings show strong spatial clustering, with most of Chicago having relatively short distances to abandoned buildings

  • Hot spots concentrated in west-central Chicago form the core abandonment cluster, with concentric rings showing spillover effects

  • Local Moran’s I identified statistically significant hot spots (High-High clusters) primarily in western neighborhoods

Model Limitations:

  • The model underpredicts in the most extreme abandonment areas (central west hotspot) and overpredicts in moderate zones

  • Both the regression model and KDE struggle with the same high-intensity areas, suggesting extreme abandonment is driven by factors beyond spatial proximity

  • If we could add additional features like economic indicators, foreclosure rates, housing age, or policy interventions could address current systematic biases